Skip to content

fix(opensearch): make shadow write log level configurable (#35302)#35375

Closed
fabrizzio-dotCMS wants to merge 4 commits intomainfrom
fix/issue-35302-shadow-write-log-level
Closed

fix(opensearch): make shadow write log level configurable (#35302)#35375
fabrizzio-dotCMS wants to merge 4 commits intomainfrom
fix/issue-35302-shadow-write-log-level

Conversation

@fabrizzio-dotCMS
Copy link
Copy Markdown
Member

Summary

  • Introduces DOTCMS_SHADOW_WRITE_LOG_LEVEL config property (default WARN) so operators can tune the visibility of OS shadow write failures in dual-write phases without code changes
  • Migrates all 6 fire-and-forget log sites (PhaseRouter ×3, BulkProcessorListener ×2, ContentletIndexAPIImpl ×1) to the new centralised IndexConfigHelper.logShadowWriteFailure() helper
  • Adds PhaseRouterTest with 12 unit tests covering the mismatched-index-name scenario (ES has working_T0, OS has working_T1) across all four migration phases

Usage

# dotmarketing-config.properties (or -D flag)
DOTCMS_SHADOW_WRITE_LOG_LEVEL=ERROR   # surface shadow failures in alerts during QA
DOTCMS_SHADOW_WRITE_LOG_LEVEL=DEBUG   # suppress during steady-state migration
# default: WARN

Test plan

  • ./mvnw test -pl :dotcms-core -Dtest=PhaseRouterTest — 12/12 pass
  • ./mvnw compile -pl :dotcms-core -DskipTests — compiles clean
  • Manual: set DOTCMS_SHADOW_WRITE_LOG_LEVEL=ERROR, trigger a Phase 1 shadow write failure, confirm ERROR appears in logs
  • Manual: set DOTCMS_SHADOW_WRITE_LOG_LEVEL=DEBUG, confirm shadow failures are silent at WARN level

Closes #35302

🤖 Generated with Claude Code

…e-and-forget tests (#35302)

Introduces DOTCMS_SHADOW_WRITE_LOG_LEVEL (default WARN) so operators can raise
shadow OS write failures to ERROR during QA or lower them to DEBUG during
steady-state migration, without code changes. Adds PhaseRouterTest covering
all four migration phases with the mismatched-index-name scenario.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 17, 2026

Claude finished @fabrizzio-dotCMS's task in 2m 58s —— View job


PR Review

  • Read changed files
  • Analyze for correctness, edge cases, design issues
  • Post review findings

Findings

1. INFO level silently drops the Throwable — IndexConfigHelper.java:62

case "INFO":  Logger.info(clazz, message);    break;   // t is silently dropped

Logger.info has no Throwable overload (confirmed: only info(Class, String) exists). All other cases (DEBUG, WARN, ERROR) pass t correctly. The Javadoc says @param t the throwable, or null if none, implying callers expect it to be used. An operator who sets DOTCMS_SHADOW_WRITE_LOG_LEVEL=INFO to reduce noise will silently lose stack traces for non-null throwables — with no warning. Four of the six call sites pass a non-null t.

Either document this in the Javadoc explicitly, or include the exception message inline:

case "INFO": Logger.info(clazz, t != null ? message + " — " + t.getMessage() : message); break;

Fix this →


2. writeBoolean sentinel primaryResult = truePhaseRouter.java:238

boolean primaryResult = true;  // sentinel: optimistic default

The default value true is never observed in practice (primary is always in the providers list), but it's a misleading sentinel. If writeProviders() ever returns a list that doesn't include readProvider() (e.g., a future phase), the method silently returns true instead of throwing. false would be a safer pessimistic default, or use a Boolean (nullable) and throw if never set.


3. No test coverage for the log-level dispatch itself

PhaseRouterTest has 12 tests covering routing semantics, but none verify that logShadowWriteFailure actually routes to the correct level, or that the INFO case does/doesn't include the Throwable. This is the core behavioral change of the PR and it has no direct tests.


4. Config re-read on every logShadowWriteFailure call

final String level = Config.getStringProperty(SHADOW_WRITE_LOG_LEVEL_KEY, "WARN").toUpperCase();

In a bulk-processor failure storm (e.g., OS unavailable during a full reindex), this method could be called thousands of times per batch. Config.getStringProperty is not free. The hot-reload flexibility is nice, but consider whether it's worth the overhead on a hot failure path. If hot-reload isn't a requirement, cache the value. If it is, that should be documented.


5. queueApi = null in the full test constructor — ContentletIndexAPIImpl.java:215

this.queueApi = null; // not needed for the methods under test

If a future test exercises a path that calls queueApi.deleteReindexEntry() or queueApi.markAsFailed(), it gets a NPE with no useful message. Consider a fail-fast alternative (null object or explicit Objects.requireNonNull guard in the methods that use it, or a mock that throws a descriptive UnsupportedOperationException).


6. TOCTOU on phase reads (pre-existing, but amplified by this PR)

Several methods call MigrationPhase.current() multiple times within one operation — e.g., writeChecked calls writeProviders() and then readProvider() separately; isPhase2() calls isReadEnabled() and !isMigrationComplete(). Each call re-reads the config. A phase transition concurrent with a write operation could observe two different phases, routing to incorrect providers. This pre-dates this PR but the new tests would be a good place to document the assumption (phase is stable for the duration of a write).


Summary: The INFO/Throwable issue (point 1) is the clearest correctness gap introduced by this PR. Points 2, 5 are minor fragility concerns. Points 3 and 4 are worth a follow-up.

fabrizzio-dotCMS and others added 3 commits April 20, 2026 13:08
…entletIndexAPIImpl

Adds ContentletIndexAPIImplPhaseTest (unit, full mock injection) and
ContentletIndexAPIImplMigrationIT (integration, requires two real clusters)
covering deactivateIndex, createContentIndex, closeIndex, and activateIndex
across all four migration phases. Includes a full-dependency constructor on
ContentletIndexAPIImpl for isolated unit testing without APILocator.
Registers ContentletIndexAPIImplMigrationIT in OpenSearchUpgradeSuite.

Refs: #35302

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ContentIndex

addCustomMapping was called unconditionally after a failed index creation, which
could apply mappings to a non-existent index and surface an unrelated exception
instead of the clean false-return the soft-failure contract promises.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fabrizzio-dotCMS fabrizzio-dotCMS deleted the fix/issue-35302-shadow-write-log-level branch April 20, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[BUG] OS exceptions propagate to client in Phase 1 dual-write (should be fire-and-forget)

1 participant